CIS 634 Information Retrieval 

Distance Learning Lecture 6 Part 2 



Materials: 

ssAn Introduction to Neural Networks, Ch 1, by Kevin 
Gurney httD://www.shef.ac.uk/ps ychobqy/qurnev/nolBs/ 
ssPapers from AI Lab, and Web SOM research centers. 
(These papers are available through links on the syllabus.) 



Copy Right © 2002 All Rights Resen/ed 
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m Create a document collection based on your 
REl. Please remember to remove duplicates. 

* Create a vocabulary file with default lexical 
options for your own document collection. 

H Create a document-term matrix 

» Perform document classification with SOM. 

s Interpret results. 

S8 Present results In cis634re2.html. 
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HI Create a directory named "model4" under 
~yourusername/public_html/cis634 directory. 

■ Create a directory named "mycollection" 
under ~yourusername/public_litml/cis634 

directory. 

IS Create groupl and group2 sub-directories 
under "mycollection." 
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Creating document collections 



m Copy top 25 retrieved documents from 
modell of your REl into mycollection/groupl. 

Id Copy top 25 retrieved documents from 
model2 of your REl into mycollection/group2. 
(Remove duplicates; Remember to 
check each document number to see if it 
is already in aroupl. If ves, do not copy 
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m Copying retrieved documents into 
mycollection/groupl : 

M Make sure you are in 
~yourusername/public_html/cis634/mycollecti 
on/groupl 
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Creating document collections 




^ At the system prompt, type in: cp 
filename . 

ss . (the dot sign) means the current directory 

5S ex: cp /afs/cad/u/w/u/wu/cis634 

/tc/lisa/text/group0/doc_306 . 

^ Repeat the same process for group2, which 
contains retrieved documents from model2 of 
REl (remember to remove duplicates). 
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document co! 


Sections 




Aixer you riavc cicaceu cne uocumeni. coiiccciori/ 
execute these 2 commands: 

m more ~younjsername/public_html/cis634/ 

mycollection/groupl/* > 
~yourusername/public_html/ 

cis634/model4/groupl.txt 

^ more ~yourusername/public_html/cis634/ 

mycollection/group2/* > 
~yourusername/public_html/ 

cis634/model4/group2.txt 
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Usinq Rainbow to Create 




Remember the test collection now is in 

'^yourusername/public_html/cis634/my 

collection/* 

m Go to BOW directory, at the system 
prompt, type in: 

./rainbow -d ~yourusername/public_html 

/cis634/model4 --index ~yourusername/ 
public_html/cis634/mycollection/* 
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Imprint ing the D~T Matrix 

■ Only the top 5 terms (based on info-grain) 
from the vocabulary lists are selected. 

m Type in the following at the system prompt: 

./rainbow -d ~yourusername/public_html 

/cis634/model4 ~prune-vocab-by-infogain=5 - 
print-matrix=abe > ~yourusername/ 

public_html/cis634/model4/matrix 

m Check BOW web site to see what "abe" 
means. 
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» ftp the matrix file to your PC. 

■ Open it with Excel, select "delimited," and select 
"space" as delimiters. 

« Delete 2"^^ column (the class name). 

88 Move the document number {1^ column) to the last 

column. 

» Delete paths on the last column (before the actual 
document numbers). 

88 Only document numbers and frequency counts are 
left. 
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■ Insert an empty row before the first row. 
Type in 5 (for 5 properties) in tlie very first 

cell. 

ss Save the file in "Text" (Tab delimited) format, 
the file name is matrix.txt 

u Upload this file back to model4 directory. 

■ ****However, Nenet uses *.dat for input 
matrix files. You will have to specify the file 
type as ''all files/' when opening data file in 
Nenet. 
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The trial version is available at: 

http://koti.mbnet.fi/^phodiu/nenet 

nerai.html 

Trial version has limited capability: up to 8 
properties, 6x6 dimensions (36 neurons). 
However, if the matrix has 8 properties, 
Nenet seems to have trouble with it. So, 
please limit your raw data (matrix and 
matrix.txt files) to exact 5 properties. 
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m Create a file folder named temp. 
Download all three zip files to temp and 
uncompress them with WinZip. Install 
the software by clicking on setup.exe. 

m If your PC doesn't have WinZip, 
download it here: 

http://www>winzip.cQm/ 
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enet Demo & Dataset 



s Interactive demo: 



» For your RE2, the initial dataset, training dataset, and 
test dataset are the same, that is ''matrix.txt". 

ss Nenet uses *.dat for input matrix files. You will have 

to specify the file type as ''all files," when opening 

data file in Nenet. 
88 Remember to select ''Use Automatic Labeling" at the 

testing stage, (or your map will not have document 

numbers as labels!!) 
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View Results on the Feature Map 

S8 After training and testing, Nenet presents the 
results on a map similar to slide 19. 

M Click on "view" "labels and vectors/' Nenet 
will bring you to a screen similar to slide 20. 

H Click on any neuron on the map that has 
document numbers on it, you will see a list of 
document numbers associated with that 
neuron. 
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How are Doc# mapped to the 




m When labeling, each document vector is 
compared to the final vector of weights 
of each neuron. 

m The best matching neuron determines 
where the document# will be located 
on the map. 
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Copy Map to Clipboard 
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Save this map in matrix.jpg or matnx.gif 




■ Use the D-T matrix (matrix.txt) created 
earlier for document clustering with Nenet. 



■ Follow the instructions on the interactive 

demo. 

^ Save the final results in matrix.cod file. 

^ Upload the matrix.txt, matrix.cod, and 
matrix.jpg to model4 directory. 

■ Create RE2 page, format: 

mi 
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Nenet trial version has limited capabilities: up 

to 8 properties, and 2000 records. 

An alternative: SOM_PAK does not have 
restrictions on the size of datasets. The 
original D-T matrix for the output map in slide 
26 has 25 properties (terms). 

SOI^_PAK is located at: ~wu/IR_Tools/som 

However, the postscript map files created by 
SOM PAK are hard to read. 
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H^HF ^^My^wMP' ^^HF ' t^^H F ^^^v ^SBBf ^^SS^ ^SBBS^ ^^Br wbsssb ^ wBSBBS ^ 

JM| im jm itti im imi ^sk jkiimi iik ii^ 

wwwwwwwww 
dflli^ Jli jflU JIbdtti Jli dflkJIMByilllfe jjjflfeiBb^ 

'MBBBWy" ^' ffMBm y ^WMMM By ^ jHMHW BB' ^WHHBBHBy ^ ^WBBBM Mff* ^SaaBBBF ^jtaaBuF ^HI|l|l|l|l| Wy ^^^^^^K^ ^^^^^^H^ ^ IjBHHHHH BBBff* 
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g: An A lternative to the Alternative 

ffl Use SOM_PAK for the whole process and create 
matrix.cod output map file, (Save the file in your 
model4 directory.) 

ss Download the matrix.cod to your PC, and read the 

file with Nenet. (file ^open^ matrix.cod) 
« No instructions on SOM_PAK will be provided. You 

will have to read SOM_PAK manual by yourself, 
js However, those use SOM_PAK to process the D-T 

matrix with higher number of properties, will receive 

2 extra points. 
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Directory structure of your UNIX account should look like this 




Note: ~wu should be 
■^ourusemame 



~wu/public_html/cis634 



-wu/public_html/cis634/ 
mycollsction 



groupl 



~wu/j3ublic_html/cis634/ 



group2 



vocabulary 



~wu/public_html/cis634/ 
cis634re1 .html 



~wu/public_html/cis634/ 
cis634re2.html 



(mycollection has two sub-directories, 
groupl stores retrieved documents from 
modeh of RE1 , and group2 stores those 
from model2) 



Oct/04/2001 



group1.txt group2 
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What are the differences betv 
the results from RE2 and Wei 



m RE2: a neuron can have multiple document # 
associated with it, namely, many labels. 

* WebSOM: each area is labeled with one term 
only. 

» Note: When talking about term space, researchers 
tend to use "terms" and "concepts" 
interchangablely. 

is What makes them different? 
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Keyword Selection for Term 



« Maps created by WebSOM group and AI Lab can be 

viewed as concept maps, 
js Each area (not neuron) on the concept map 
represents a major concept. 

88 Select one term only from terms associated with a 
map area to be the label. 

« It is less useful to assign a document to be the label 
on document map. 

s Reason: Terms as labels are self-explanatory, but document 
# are not. 
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« For example: a list of terms that could be inside the 
same area on the SOM feature map: 

automatic classification, term classification, 
document classification, clustering, k-means, 
hierarchical clustering, document space, concept 
space, ..etc. 

» In this case, automatic classification could be the 
best candidate to be the label of this area. 

88 The WebSOM study extended the algorithm to select 
representative terms as labels. 
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How to Use SOM for 2"^ Type 



m Initialization and training process is tiie 
same. 

^ The only difference is tiie testing part - 
use a different set of D-T matrix. 

M How can tiie resultant maps be used? 
» Automatic Cataloging 
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